Day 14 - Regular expressions - Classes
– I sat next to you in Mrs. Walsh’s English class!
Groundhog Day (1993)
You were so amazed by the power of regular expressions that you decided to come back and proceed
with your education! What did you say? I see, your boss forced you to learn regular expressions but
your real dream is to be an action films star. You should consider taking some acting classes. Speaking
of which, the topic of this lesson is classes in regular expressions, which are not evening courses,
but collections of characters.
In the previous chapter we learned how to match a single specific character in a regular expression
and how to match any single character. These two choices are very handy but often we need to
match a set of character, for example the numbers 1, 2, or 3, or the letters between “a” and “f”.
Neither of the two syntaxes we discussed last time can provide this sort of match, so we need a new
one.
In a regular expression, the syntax [<characters>] means “any single character in the list”, and it
is exactly what we need in this case. For example
$ cat examples.txt | grep -E "[abc]"
matches a single “a”, a single “b”, or a single “c”. Remember that grep highlights all matching
elements in each line, and prints the whole line, use the -o option if you want to print the matching
part only. The line “dog”, for example, is excluded from the output as it doesn’t contain any of the
three letters in the class.
Classes are especially useful because they allow you to use ranges. For example
$ cat examples.txt | grep -E "[a-z]"
matches all lowercase letters of the English alphabet. This will highlight whole words like “gorilla”
and “aardvark”, as they are composed of lowercase letters only. “Johnny 5”, instead, is not completely
highlighted, as the capital “J” and the number 5 are not matched by the regular expression.
If you use regular expressions in an editor to search for strings (I let you discover how your favourite
editor allows you to do this) the syntax [a-z] will match the first lowercase letter in the text.
Repeating the search will find the second one, and so on.
Typical ranges are a-z for lowercase letters, A-Z for uppercase ones, and 0-9 for digits. You can use
more than one range in a class, for example